A Machine Learning Model Ensemble for Mixed Power Load Forecasting across Multiple Time Horizons

The increasing penetration of renewable energy sources tends to redirect the power systems community’s interest from the traditional power grid model towards the smart grid framework. During this transition, load forecasting for various time horizons constitutes an essential electric utility task in network planning, operation, and management. This paper presents a novel mixed power-load forecasting scheme for multiple prediction horizons ranging from 15 min to 24 h ahead. The proposed approach makes use of a pool of models trained by several machine-learning methods with different characteristics, namely neural networks, linear regression, support vector regression, random forests, and sparse regression. The final prediction values are calculated using an online decision mechanism based on weighting the individual models according to their past performance. The proposed scheme is evaluated on real electrical load data sensed from a high voltage/medium voltage substation and is shown to be highly effective, as it results in R2 coefficient values ranging from 0.99 to 0.79 for prediction horizons ranging from 15 min to 24 h ahead, respectively. The method is compared to several state-of-the-art machine-learning approaches, as well as a different ensemble method, producing highly competitive results in terms of prediction accuracy.


Introduction
The modernization of the communication infrastructure of the electrical grid, featuring smart sensors, IoT, and edge computing [1], as well as the deregulation of the electric power markets, has enabled the proliferation of distributed generation, mainly from renewable energy sources (RES) [2]. This new paradigm, while actualizing the much sought-after decarbonization of energy generation, has jeopardized the stability of the distribution network due to the intermittency of the aforementioned resources, mainly in isolated power grids, such as non-interconnected islands. The effect of this intermittency is twofold: From the distribution system operator perspective, uncertainty in RES generation compromises the ability to effectively plan short-term unit commitment scheduling [3][4][5][6], while from the energy market bidder perspective, stochasticity severely constrains bidding strategy and thus, reduces profit margins [7]. An important development in the field of energy transactions is the participation of the energy market in the distribution grid through ancillary services, which is expected to be established in the upcoming years [8].
These shortcomings underline the importance of the application of effective electric load prediction models in the context of multiple operational aspects of the smart grid, such as power stability and security. Especially in the case of micro-grids, storage management is critical and cannot be accomplished without the aid of accurate short-term load forecasts for load shifting and balancing operations [9,10]. Moreover, the grid extension and the aforementioned models is to be expected taking into account the undesirable characteristics of the load forecasting problem, which include non-linearities and high levels of noise in the associated data. Furthermore, load time series are not statistically static [64], due to the volatile, rapidly changing nature of the weather conditions that affect their power generation component. Different classes of machine-learning methods can cope better with some of these issues but usually underperform with respect to others, e.g., linear models are more robust to noise but cannot capture the non-linearities present in the load forecasting problem. To make things worse, though all of these problems are inherent to load forecasting, their mixture composition changes depending on the time horizon one tries to predict for, making it impossible to single out a unique machine-learning method that could outperform the others across different prediction horizons, e.g., linear methods are often found to perform better in short-term horizons, where data tend to be noisy, but the non-linearities can usually be adequately approximated by linear models, but mostly fail in longer time horizons, where the role of the non-linearities is dominant. It should be noted that the previous observation about the inability of a single method to beat all the others is not only tied to the context of load forecasting, but reveals a more generic concept in machine-learning and optimization, as expressed by the "no free lunch" theorem [65].
To remedy this predicament, one could resort to using a multi-model approach [66], combining multiple machine-learning methods. Unfortunately, in a real-time deployment scenario, an important practical consideration arises for multi-model schemes: How does one select the most suitable model from a pool of trained models for the next prediction timestep? One solution is to employ a rule-based decision system that uses a priori available knowledge, such as the time of day and measured weather conditions at the substation level. This presents a significant impediment. Not only are the rules of such a system difficult to conceptualize, but they also offer no guarantee of continuously optimal model selection. Doing away with a decision system altogether is also problematic since the individually generated predictions do not offer any actionable insight by themselves. Whether a distribution system operator technician or a RES aggregator, a practitioner requires a single forecast value in order to develop their operations strategy. A practical workaround is to discard such selection rules and instead employ a weighting system that assesses models only by using their past prediction performance [67]. The weighting of the output results of basic forecasting LSTM models in [68] is based on the similarity degree between target and identified standard values of load consumption. Two different approaches for determining the weights of multiple forecasters are followed in [69,70], using a novel incremental ensemble weight updating strategy and the minimum-error method, respectively. Alternatively, an extreme learning machine can be employed for combining the outputs of a pool of forecasts, as in [71]. An intelligent decision-making support scheme, including predictive performance evaluation, model properties analysis, structure and fusion strategy optimization, and optimal model preference selection, is incorporated with an evolutionary ensemble learning method proposed in [72] for shortterm load forecasting (STLF) problems. Finally, an automated system is established in [73] based on hidden Markov chains for extracting similar day profiles to obtain the best model from a library of available forecasting models. Differently from the previous works, the output neural network (NN) models result from multiple training cycles based on snapshots [74] or the hidden features of a Random Vector Functional Link network [75].
It has become clear that the necessity of providing mixed load forecasts, and indeed for multiple short-term horizons, is a factor of paramount importance in the upcoming transition to smart electricity grids. Moreover, according to the preceding literature review, it is evident that in order to enhance the predictive capability of a model, it should incorporate more than one machine-learning methodologies, which of course should be able to handle the complex dynamic behavior of the mixed load. Finally, such a methodology is necessary to be applicable in an online implementation, which means that the final predictions should be provided in a reasonable amount of time and respond to the behavior of the load through a dynamic decision mechanism.
Sensors 2023, 23, 5436 5 of 25 Realizing the aforementioned requirements and seeking to fill the corresponding research gaps, in this work, we present a novel forecasting scheme that is able to efficiently address the diverse and adverse characteristics of the load forecasting problem for various prediction horizons. The proposed method seeks to create an ensemble of prediction models based on multiple machine-learning techniques comprising different beneficial characteristics that have only been used individually for load forecasting before. Indeed, the sparse coding method introduced in the proposed model has been published very recently and used for the first time in ensemble schemes. As the participating techniques excel in different aspects of the load forecasting problem, their combined usage introduced in this work provides the ensemble with the ability to outperform each individual method in all the horizons tested. In order to efficiently combine the different machine-learning techniques, the proposed method employs an error-based metric on a rolling window of past predictions. This approach enhances the novelty of the proposed method as it does away with the adversity exhibited by complex, rule-based model selection systems. By combining the beneficial characteristics of the aforementioned techniques, the proposed scheme demonstrates superior performance in terms of prediction accuracy, compared to all the submodels, as well as a recently proposed MLP model ensemble from the literature [76], through a wide range of different prediction horizons, spanning from 15 min to 24 h-ahead. Thus, reliable forecasts can be obtained for: (a) One hour ahead or less, which are valuable for various applications at the transmission and distribution network, (b) one day ahead, contributing to the scheduling of generation sources and (c) intra-day forecasting, so as to achieve better optimization results. As a result, the introduced model ensemble can become a powerful tool for administrators and participants in the energy market, easily exploitable in both operational and managerial tasks of smart grids. It should be noted that, at least to the authors' best knowledge, no machine-learning approach that is able to handle this range of prediction horizons has been proposed in the literature. Furthermore, the proposed approach expands the existing literature by using mixed power-load data, i.e., data that include renewable generation measurements. Although there is an abundance of work in forecasting the net power load, the literature on mixed-load forecasting is very scarce. It should be pointed out that the employment of mixed measurements is aligned with the requirements of modern smart grids, where the penetration of renewable resources is a key feature.
The paper is structured as follows: Section 2.1 provides a short description of the different ML methods exploited for building the proposed pool of models. In Section 2.2, the proposed approach is presented analytically and then follows the application of the multi-model scheme upon a certain case study in Section 3, where information about the data and the training process and finally results are given in Section 4. Subsequently, in Section 5, the obtained results are discussed and explained. Finally, conclusions and guidelines for future work are outlined in Section 6.

Machine-Learning Methods Short Description
As mentioned earlier, multiple machine-learning methods are involved in the proposed approach. In this subsection, a short description of each one of these methods is provided. Here, we provide a short description of each one of them.

Linear Regression
LR is considered a standard method for addressing problems such as time series prediction, outlier detection, reliability analysis, and feature selection. The regression analysis method is basically a curve-fitting problem. Given a training dataset, (y n , x n ), y n ∈ R, x n ∈ R l , n = 1, 2, . . . , l, where y n ∈ R represents the output or dependent variable and x n ∈ R l , n = 1, 2, . . . , l represents the input vector or regressor [77], the aim is to find a function, f , which fits the data. Subsequently, when an unknown data point x * appears, we can use this function in order to calculate/predict the respective output y * [19]. Equation (1) describes the relation between the input and output variables.
The objective of a regression problem is the estimation of regression coefficients vector θ which arises through the solution of a least squares problem.
Although more modern and advanced methods have been developed, LR is still used due to its simplicity and robustness, which are of great importance, especially in online implementation of load forecasting. However, the inability of the method to extract the non-linear behavior of the load is an important disadvantage.

Sparse Coding
When addressing a linear system, for example x = Dα, x ∈ R m×n , α ∈ R k×n , the number of predictors, p, can be extremely large. Thus, it is impossible to fit a linear model when p < m, or even when p ≈ m without overfitting (depending on the noise level), but it may still be possible to fit a sparse linear model that only depends on a reduced number of predictors s, where s < p. The dictionary D ∈ R m×k is underdetermined, and therefore, the linear system has infinite possible solutions. The sparse regression (SR) problem is defined as the search for the sparsest solution, i.e., the one with the fewest non-zeros, and is described by Equation (3) [78]: where a 0 is the 0 norm, which counts the non-zero components of a. Although the problem in question is NP-hard, it can often be solved using approximation methods, such as greedy algorithms: Orthogonal matching pursuit [79], thresholding algorithm [80], or relaxation algorithms such as basis pursuit [81] are commonly used for this purpose. SR is a methodology of low complexity that carries the disadvantage of linear correlation assumption between mixed load features, but on the other hand is able to prevent overfitting compared to more complicated ML approaches. A key point stage in the sparse representation procedure is the so-called dictionary learning, which consists of finding the elements of the dictionary (atoms). Dictionary learning can be formulated as a joint unconstrained optimization problem [82], given in the form of (4).
where C = D ∈ R m×k s.t.∀j = 1, . . . , k, d T j d j ≤ 1 , denotes the feasible space of dictionary D. The role of the parameter λ is to regulate the sparsity level of the coefficient vector.

Support Vector Regression
The next method to be included in the pool of models constitutes an extension of support vector machines in regression and is called SVR. The basic idea of this methodology is the use of a non-linear transformation ϕ(·) : R n → R n h that maps the real data into a multi-dimensional space [87] and the subsequent application of LR. According to this approach, a linear function s is supposed to exist in the multi-dimensional space, which models the non-linear relation between the input and output data of the initial space [88]. Such an equation is given in (5).
where ϕ(x) denotes the kernel function and w T ∈ R n h , b ∈ R are the regression coefficients. The problem of calculating the variables w and b is reduced to the minimization of the structural risk functional.
where y contains the real measurements and C is a penalty term used to balance between data fitting and overfitting. The employment of the kernel trick allows SVR to acknowledge the presence of non-linearity in mixed load series. However, when a separating hyperplane in a given dimension cannot be found, then it is required to move in a higher dimension. In this case, the computational cost will increase as well. Furthermore, the use of support vectors makes the method sensitive to noisy data and outliers.

Neural Networks
Neural networks (NNs) constitute an important family of black-box modeling techniques. NNs are very accurate, robust, fault-tolerant, and flexible to adapt to any process given a suitable number of quality data. The proposed model ensemble includes two representative NN architectures, namely the MLP and the RBF. MLPs identify the process dynamics and form the model by guiding the input data through weighted successive layers of non-linear functions (threshold, sigmoid, etc.) called activation functions or nodes. The input (activity) µ l (x) to each one of the L nodes is the weighted sum of all N input variables to that node. µ l (x) = x·w (7) where w are the weights corresponding to each variable of the input vector x. The intermediate layers between the input and the output are called hidden layers. The schematic of a typical MLP NN with 2 hidden layers is presented in Figure 1. The prediction produced by an MLP is the weighted sum of the final layer outputs. Due to the existence of non-linear characteristics in mixed-power load data, MLPs certainly seem like a promising method for the problem at hand.  (5) where denotes the kernel function and ∈ ℝ , ∈ ℝ are the regression coefficients. The problem of calculating the variables and is reduced to the minimization of the structural risk functional.
where contains the real measurements and is a penalty term used to balance between data fitting and overfitting.
The employment of the kernel trick allows SVR to acknowledge the presence of nonlinearity in mixed load series. However, when a separating hyperplane in a given dimension cannot be found, then it is required to move in a higher dimension. In this case, the computational cost will increase as well. Furthermore, the use of support vectors makes the method sensitive to noisy data and outliers.

Neural Networks
Neural networks (NNs) constitute an important family of black-box modeling techniques. NNs are very accurate, robust, fault-tolerant, and flexible to adapt to any process given a suitable number of quality data. The proposed model ensemble includes two representative NN architectures, namely the MLP and the RBF. MLPs identify the process dynamics and form the model by guiding the input data through weighted successive layers of non-linear functions (threshold, sigmoid, etc.) called activation functions or nodes. The input (activity) to each one of the nodes is the weighted sum of all input variables to that node.
where are the weights corresponding to each variable of the input vector . The intermediate layers between the input and the output are called hidden layers. The schematic of a typical MLP NN with 2 hidden layers is presented in Figure 1. The prediction produced by an MLP is the weighted sum of the final layer outputs. Due to the existence of non-linear characteristics in mixed-power load data, MLPs certainly seem like a promising method for the problem at hand. On the downside, MLP training is usually performed by some form of backpropagation technique which usually requires more than one intertwined iterative procedure to On the downside, MLP training is usually performed by some form of backpropagation technique which usually requires more than one intertwined iterative procedure to fully optimize the involved parameters, i.e., the number of layers and nodes, the weights, etc. Depending on the size and architecture of the MLP network, and the input space, this procedure may become computationally intensive, and thus it is commonly performed offline. A quite critical drawback of all backpropagation-based techniques is that they get easily trapped in local minima. In this case, the provided solution may not be satisfactory, a fact which leads to a tedious retraining procedure.
RBFs are similar to the MLPs in the sense that data are fed through the input layer and follow a straight path to the output layer but they differ in that there exists only one hidden layer which comprises radially symmetric activation functions (Gaussian, quadratic, thin plate spline, etc.). A typical RBF NN using Gaussian activation functions can be seen in Figure 2. The input layer distributes the data of the N inputs to the L nodes of the hidden layer, which are positioned to a specific point of the input space through a process of training. The activity µ l (u k ) of each node is calculated using the Euclidian distance between the input data u k and the center c l of each node.
where K is the number of training data. The chosen RBF g k receives the activity value and calculates the node output. The linear combination of all hidden layer node outputs provides the NNs predictionŷ kŷ k = g k ·w (9) Sensors 2023, 23, x FOR PEER REVIEW 8 of 26 fully optimize the involved parameters, i.e., the number of layers and nodes, the weights, etc. Depending on the size and architecture of the MLP network, and the input space, this procedure may become computationally intensive, and thus it is commonly performed offline. A quite critical drawback of all backpropagation-based techniques is that they get easily trapped in local minima. In this case, the provided solution may not be satisfactory, a fact which leads to a tedious retraining procedure.
RBFs are similar to the MLPs in the sense that data are fed through the input layer and follow a straight path to the output layer but they differ in that there exists only one hidden layer which comprises radially symmetric activation functions (Gaussian, quadratic, thin plate spline, etc.). A typical RBF NN using Gaussian activation functions can be seen in Figure 2. The input layer distributes the data of the inputs to the nodes of the hidden layer, which are positioned to a specific point of the input space through a process of training. The activity of each node is calculated using the Euclidian distance between the input data and the center of each node.
where is the number of training data. The chosen RBF receives the activity value and calculates the node output. The linear combination of all hidden layer node outputs provides the NNs prediction The training algorithm for an RBF NN is usually broken down into two phases, the first of which discovers the optimal number and location of the hidden node centers in the input space, while the second one calculates the weights usually through simple LR. Due to the fact that the training process is broken into two phases, RBF NNs are able to use very fast algorithms. In fact, some of the current RBF training techniques [89] are deterministic and non-iterative, requiring only a single pass of the data to converge, in contrast to other NN architectures which (a) are epoch-based requiring multiple passes of the data and (b) are stochastic requiring multiple runs to overcome their sensitivity to initial conditions. RBF networks provide very strong interpolation tools, usually outperforming other NN-based techniques provided that dense and good quality training data are available. In the absence of adequate training data though, their performance may become rather poor. Therefore, their application, in combination with the other models of the pool, can make a significant contribution to mixed-power load prediction. The training algorithm for an RBF NN is usually broken down into two phases, the first of which discovers the optimal number and location of the hidden node centers in the input space, while the second one calculates the weights w usually through simple LR. Due to the fact that the training process is broken into two phases, RBF NNs are able to use very fast algorithms. In fact, some of the current RBF training techniques [89] are deterministic and non-iterative, requiring only a single pass of the data to converge, in contrast to other NN architectures which (a) are epoch-based requiring multiple passes of the data and (b) are stochastic requiring multiple runs to overcome their sensitivity to initial conditions. RBF networks provide very strong interpolation tools, usually outperforming other NN-based techniques provided that dense and good quality training data are available. In the absence of adequate training data though, their performance may become rather poor. Therefore, their application, in combination with the other models of the pool, can make a significant contribution to mixed-power load prediction.

. Random Forests
The last method involved in the proposed approach is RF [90]. As the name suggests, an RF is a tree-based ensemble, with each tree depending on a collection of random variables. More formally, for a p-dimensional random vector X = X 1 , . . . , X p T representing the real-valued input or predictor variables and a random variable Y representing the real-valued response, we assume an unknown joint distribution P XY (X, Y). The goal is to find a prediction function f (X) for predicting Y. The prediction function is determined by a loss function L(Y, f (X)) and defined to minimize the expected value of the loss where the subscripts denote expectation with respect to the joint distribution of X and Y.
It penalizes values of f (X) that are a long way from Y. As for simple LR, squared error loss could be a typical choice of L. Ensembles construct f in terms of a collection of linear estimators of x, the so-called "base where J denotes the number of trees and is user-specified, according to the following iterative procedure.
sample D j of size N is extracted from D and a corresponding tree h j X, Θ j is derived implementing the binary recursive partitioning [91]. The prediction extraction using a standard RF regressor is depicted in Figure 3. For each unsplit node of the tree, the best binary split among all binary splits on the m ∈ (1, p) predictors, is found. The component Θ j is used to inject randomness first by bootstrap sampling and second by the random subset of m predictors. Once the base learners are found, a prediction at a new point x is given by Sensors 2023, 23, x FOR PEER REVIEW 9 of 26

Random Forests
The last method involved in the proposed approach is RF [90]. As the name suggests, an RF is a tree-based ensemble, with each tree depending on a collection of random variables. More formally, for a -dimensional random vector , … , representing the real-valued input or predictor variables and a random variable representing the real-valued response, we assume an unknown joint distribution , . The goal is to find a prediction function for predicting . The prediction function is determined by a loss function , and defined to minimize the expected value of the loss , where the subscripts denote expectation with respect to the joint distribution of and . Intuitively, , is a measure of how close is to . It penalizes values of that are a long way from . As for simple LR, squared error loss could be a typical choice of . Ensembles construct in terms of a collection of linear estimators of , the so-called "base learners" ℎ , ℎ , … , ℎ , where denotes the number of trees and is userspecified, according to the following iterative procedure. Let , , . . , , denote the training data, with , , … , , , 1, … . For each ∈ 1, , a bootstrap sample of size is extracted from and a corresponding tree ℎ , is derived implementing the binary recursive partitioning [91]. The prediction extraction using a standard RF regressor is depicted in Figure 3. For each unsplit node of the tree, the best binary split among all binary splits on the ∈ 1, predictors, is found. The component is used to inject randomness first by bootstrap sampling and second by the random subset of predictors. Once the base learners are found, a prediction at a new point is given by RF is a simple and reliable forecasting tool. Its main limitation is the trade-off between performance and the number of trees. Increasing this parameter can lead to more accurate predictions and prevent overfitting but can also make the algorithm too slow and RF is a simple and reliable forecasting tool. Its main limitation is the trade-off between performance and the number of trees. Increasing this parameter can lead to more accurate predictions and prevent overfitting but can also make the algorithm too slow and ineffective for real-time predictions. It is, therefore, understandable that this methodology may prove suitable in specific areas of the dataset.

Machine-Learning Model Ensemble
Recognizing the individual advantages and disadvantages of the machine-learning methods described in Section 2.1, the proposed scheme seeks to create an ensemble that will successfully combine their merits in a single approach. For example, neural-networkbased models such as RBF do exhibit superior prediction performance only as long as the input data point lies well within the domain of the input training dataset, otherwise it fails. On the other hand, linear and sparse prediction models, in general, show much better extrapolative performance, even though they are unable to capture more complex, non-linear dynamics. In other words, by toggling between the robust linear models and the more sensitive but also more effective non-linear ones, a superior approach to load time series prediction can be constructed.
In order to obtain the best possible performance of each sub-model, their optimal training configuration has to be determined. Starting with the simpler methods used, a linear and an SR model are trained by least squares and fast iterative shrinkage thresholding algorithm, respectively, the latter being a faster implementation of the corresponding iterative shrinkage thresholding algorithm used for load forecasting [36]. In the case of the sparse coding approach, sparsity is induced by the 2 norm and the regularization parameter was set by trial and error to 0.01. Subsequently, a random forest regressor is employed, where the number of decision trees is selected to be 15 so as to keep the training time at a reasonable level without reducing its predictive ability. As regards the nonlinear methods, an SVR model with Gaussian kernel function was developed [92], using sequential minimal optimization for training and Bayesian optimization to optimize the model's hyperparameters [93]. Two NN models are also introduced, featuring two different architectures. The first one is a two-layered MLP network trained by the Levenberg-Marquardt backpropagation algorithm [94], following a 10-fold cross-validation. The neurons of each layer are chosen by trial and error as 20 and 10. It is noted that, in order to compensate for the performance dependence of the MLP training methods to initialization, the training procedure was conducted 10 different times, with different randomly initialized weights of the network. The second NN uses an RBF architecture and is trained using the fuzzy means technique [95], an algorithm that has found many successful applications due to the increased accuracy it provides [96][97][98] combined with fast training times [99]. In this work, the FM algorithm has been tested for a range of fuzzy sets between 4 and 15. When deployed online, the proposed approach evaluates a MAE metric on a rolling window of past predictions coming from a pool of trained models in order to create a weight vector for the next timestep prediction. An important item of the proposed method to be specified is the length of the rolling window. It can be easily inferred that this depends not only on the prediction horizon but also on the statistical properties of the predicted variable (a more volatile, non-stationary time series would require shorter rolling window horizons). Once the model pool has been populated by trained models, the optimum length of the rolling window is calculated in an exhaustive search manner over the same validation data in the range of 3-15 regressive timesteps. The proposed method operates as follows: For each timestep k, all trained models in the pool are evaluated concurrently. Their current prediction performance is assessed by applying the MAE metric on their previous predictions up to a rolling time window of length h w whereŷ i (k) are the predictions of the i-th model and y are the actual values of the times eries at timestep k. Then, the MAE metric is used to calculate the prediction weight of each model for the next timestep k + 1. where MAE i is the MAE of the i-th prediction model, N is the total number of models in the model pool, and w i is the prediction weight for the next timestep. The prediction output of the proposed method is calculated as the weighted sum of the model predictionsŷ î A snapshot of a two-model example version of the proposed method is shown in Figure 4. Note that the proposed method combines the strengths of the individual models by placing greater weight on the current better-performing model for the time window of length h w . At first, bothŷ 1 andŷ 2 models appear ineffective as individual predictors of the y time series. However, after closer inspection,ŷ 2 performs better for the first half of y, whileŷ 1 for the second half. By placing greater weight on the model with the best past prediction performance within the horizon h w , the proposed method is able to toggle towards the best available model for the current circumstance. The result is an overall superior prediction performance.
where is the MAE of the i-th prediction model, N is the total number of models in the model pool, and is the prediction weight for the next timestep. The prediction output of the proposed method is calculated as the weighted sum of the model predictions A snapshot of a two-model example version of the proposed method is shown in Figure 4. Note that the proposed method combines the strengths of the individual models by placing greater weight on the current better-performing model for the time window of length hw. At first, both and models appear ineffective as individual predictors of the time series. However, after closer inspection, performs better for the first half of , while for the second half. By placing greater weight on the model with the best past prediction performance within the horizon hw, the proposed method is able to toggle towards the best available model for the current circumstance. The result is an overall superior prediction performance. The ensemble model recognizing the superiority of over , within the rolling window adapts its weights accordingly, achieving highly accurate prediction for the next timestep k + 1.

Problem and Data Description
The main goal of this paper is to develop a methodology in order to implement a load forecasting tool able to provide accurate mixed load predictions over several different time horizons and in particular 15 min, 1-h, 2-h, 3-h, 6-h, and 24-h. This case study makes use of real data from a high voltage/medium voltage substation located in mainland Europe, measured during the years 2017-2018. The MV distribution network contains multiple photovoltaic sites. As a result, the data measurements in question constitute mixed power-load recordings, which correspond to the mixed AP demand of the distribution grid from the transmission grid. The load measurements have been recorded every minute and contain the mixed AP demand, as well as cloud coverage, wind speed, humidity, and temperature, as measured from the substation's weather station. Due to practical concerns, individual power generation or weather data from the aforementioned photovoltaic sites should not be taken into account for the creation of the input dataset since these will  The ensemble model recognizing the superiority ofŷ 1 overŷ 2 , within the rolling window adapts its weights accordingly, achieving highly accurate prediction for the next timestep k + 1.

Problem and Data Description
The main goal of this paper is to develop a methodology in order to implement a load forecasting tool able to provide accurate mixed load predictions over several different time horizons and in particular 15 min, 1-h, 2-h, 3-h, 6-h, and 24-h. This case study makes use of real data from a high voltage/medium voltage substation located in mainland Europe, measured during the years 2017-2018. The MV distribution network contains multiple photovoltaic sites. As a result, the data measurements in question constitute mixed powerload recordings, which correspond to the mixed AP demand of the distribution grid from the transmission grid. The load measurements have been recorded every minute and contain the mixed AP demand, as well as cloud coverage, wind speed, humidity, and temperature, as measured from the substation's weather station. Due to practical concerns, individual power generation or weather data from the aforementioned photovoltaic sites should not be taken into account for the creation of the input dataset since these will normally not be available for a real-life implementation. In short, in this work, we rely on the substation's historical measurements of load and weather conditions in order to create a prediction model of the mixed AP demand of the grid.

Data Preprocessing and Model Training
Unavoidably, the substation measurements contain large periods of missing or corrupt data owing to sensor downtime or malfunction. For the scope of this case study, no missing data imputation has been performed-instead, corrupted data and outlier removal was the main focus of the preprocessing operation. Due to the sheer size of the dataset, manual preprocessing was impossible, mandating the creation of a bad data detection routine. Corrupted values were decidedly easy to detect since the corresponding AP signal exhibited unusually low variance around a constant value. However, outlier values on mixed load data were a challenge to successfully handle-a review of the challenges of this topic, as well as effective techniques, is available on [100]. The chosen technique must be sufficiently effective at classifying outliers in data, while avoiding false positives. In this case study, a rolling median window threshold approach is used, as it was found to compromise well between the aforementioned points. A two-day snapshot from the application of this algorithm to raw electrical load data is presented in Figure 5. The outliers usually originate from noisy sensor readings [101]. As part of data preprocessing, a resampling step also took place, where each sample was defined as an average of 15 one-minute measurements. normally not be available for a real-life implementation. In short, in this work, we rely on the substation's historical measurements of load and weather conditions in order to create a prediction model of the mixed AP demand of the grid.

Data Preprocessing and Model Training
Unavoidably, the substation measurements contain large periods of missing or corrupt data owing to sensor downtime or malfunction. For the scope of this case study, no missing data imputation has been performed-instead, corrupted data and outlier removal was the main focus of the preprocessing operation. Due to the sheer size of the dataset, manual preprocessing was impossible, mandating the creation of a bad data detection routine. Corrupted values were decidedly easy to detect since the corresponding AP signal exhibited unusually low variance around a constant value. However, outlier values on mixed load data were a challenge to successfully handle-a review of the challenges of this topic, as well as effective techniques, is available on [100]. The chosen technique must be sufficiently effective at classifying outliers in data, while avoiding false positives. In this case study, a rolling median window threshold approach is used, as it was found to compromise well between the aforementioned points. A two-day snapshot from the application of this algorithm to raw electrical load data is presented in Figure 5. The outliers usually originate from noisy sensor readings [101]. As part of data preprocessing, a resampling step also took place, where each sample was defined as an average of 15 oneminute measurements. The task of input variable selection is closely related to the prediction horizon. All models developed in the context of this study are considered autoregressive with exogenous variables, as they use inputs that consist of previous values of the output and weather data. A set of inputs was initially constructed for each prediction horizon based on the literature. Subsequently, the contribution of these variables to the prediction accuracy improvement was examined by trial and error, sometimes leading to shorter input sets for some of the horizons. Alternatively, other approaches, such as gradient boosting decision tree and Pearson correlation coefficient [102], attention mechanism [103], or Exploratory Data Analysis [104], are considered to have an effective contribution during input features reduction and selection. However, it is important to note that for each horizon, inputs remain the same for all machine-learning methods used in the present study. The task of input variable selection is closely related to the prediction horizon. All models developed in the context of this study are considered autoregressive with exogenous variables, as they use inputs that consist of previous values of the output and weather data. A set of inputs was initially constructed for each prediction horizon based on the literature. Subsequently, the contribution of these variables to the prediction accuracy improvement was examined by trial and error, sometimes leading to shorter input sets for some of the horizons. Alternatively, other approaches, such as gradient boosting decision tree and Pearson correlation coefficient [102], attention mechanism [103], or Exploratory Data Analysis [104], are considered to have an effective contribution during input features reduction and selection. However, it is important to note that for each horizon, inputs remain the same for all machine-learning methods used in the present study.
The selected input variables which all models accept could be divided into 4 categories, as described in Table 1, namely (a) current and past AP values, (b) difference between current and past AP values, (c) average of past AP values, and (d) weather measurements. It has to be noted that p (t) values contain the current and past, average and difference measures of the AP values,p (t+s) is the output, i.e., the mixed power load s fifteen-minute intervals ahead, whereas w (t) components contain the respective weather-related inputs of cloud coverage, wind speed, humidity, and temperature, respectively.
The choice of the particular set of input variables can be justified as follows: The fact that electric load time series presents a strong dependency on previous values [47] strengthens the selection of such input variables in the form of (a). Trying to capture the trend of electrical load, differences (b) between current and previous AP values are frequently employed [31]. The implementation of past value averages (c) is also quite important, according to the literature [48]. Finally, the introduction of weather data (d) is undoubtedly an improving factor in the predictions [23,24,105]. At this point, it is important to mention that during the training stage of the forecasting model, the weather inputs w (t) are introduced as measured values of actual weather data acquired at the t time index. On the contrary, in an online implementation of the model, future weather data w (t+s) will be unknown and replaced by weather predictions, therefore introducing additional uncertainty.
Once the preprocessing stage has been completed and input variables have been selected, the dataset was partitioned in a yearly manner in order to select the training datasets. At this point, an important consideration should be made. As mentioned in the introductory section, the load time series consists of a load and generation component. The statistical properties of both of these components are not static in relation to time, especially on a long-term scale. The network physically expands, incorporating more consumers as well as RES generators, each with different load and generation profiles, respectively. Therefore, it makes sense to select training datasets as close to the actual prediction interval as possible. Since the available data concern two successive years, the data corresponding to 2017 were selected as the training subset, and the data corresponding to 2018 were selected as the testing dataset. A point worth mentioning is that no permutation step is taking place before training. This means that the data used for testing are considered completely unseen for the proposed model, yielding a more reliable forecasting model. Due to confidentiality reasons, the real and predicted mixed load values have been normalized in order to be presented. Finally, it should be noted that models that require a validation step during training, namely models based on MLP and RBF NNs, do so using 10-fold cross-validation, while in the case of models that require multiple training runs for each training seed (see MLP), the best-performing model on the validation data is kept. An overview of the implementation of the proposed model is provided in Figure 6, which illustrates, in the form of a block diagram, the entire sequence of steps that take place, starting from the acquisition of the raw AP data from the substation to the derivation of the final forecasts. It has to be highlighted that this figure is generic and does not refer to a particular prediction horizon. 10-fold cross-validation, while in the case of models that require multiple training runs for each training seed (see MLP), the best-performing model on the validation data is kept. An overview of the implementation of the proposed model is provided in Figure 6, which illustrates, in the form of a block diagram, the entire sequence of steps that take place, starting from the acquisition of the raw AP data from the substation to the derivation of the final forecasts. It has to be highlighted that this figure is generic and does not refer to a particular prediction horizon. Figure 6. Overview of the proposed model ensemble. Its application in mixed load forecasting comprises a series of steps, i.e. raw data acquisition, data preprocessing, collection of input variables, splitting of the dataset in a training and a testing subset, training of submodels, generation of the next AP forecast by each submodel, weighting of the individual predictions, and, lastly, calculation of the next AP final forecast.
At this point, it should be mentioned that in order to evaluate the accuracy of the proposed method, it was considered appropriate to compare it with a model ensemble from the literature. To be more specific, we employed a method proposed for load forecasting based on an ensemble of multiple MLP neural networks [76]. Consequently, following the experimental protocol described in this work, a number of feed-forward NNs, with a single hidden layer, were trained on 14 different random initializations of the weights. For each initialization, the number of neurons in the hidden layer ranged from 3 to 50. The hyperbolic tangent sigmoid function was selected as the transfer function among the NNs' layers, while all NNs were trained using the resilient backpropagation algorithm. The neural networks were arranged in ascending order with respect to the MAPE error on a common validation set, which, in this case, was defined as 20% of the training dataset. Then, the networks corresponding to the first 5 MAPE errors were selected, and the final forecasts were obtained by averaging the individual forecasts of these 5 models. Figure 6. Overview of the proposed model ensemble. Its application in mixed load forecasting comprises a series of steps, i.e. raw data acquisition, data preprocessing, collection of input variables, splitting of the dataset in a training and a testing subset, training of submodels, generation of the next AP forecast by each submodel, weighting of the individual predictions, and, lastly, calculation of the next AP final forecast.
At this point, it should be mentioned that in order to evaluate the accuracy of the proposed method, it was considered appropriate to compare it with a model ensemble from the literature. To be more specific, we employed a method proposed for load forecasting based on an ensemble of multiple MLP neural networks [76]. Consequently, following the experimental protocol described in this work, a number of feed-forward NNs, with a single hidden layer, were trained on 14 different random initializations of the weights. For each initialization, the number of neurons in the hidden layer ranged from 3 to 50. The hyperbolic tangent sigmoid function was selected as the transfer function among the NNs' layers, while all NNs were trained using the resilient backpropagation algorithm. The neural networks were arranged in ascending order with respect to the MAPE error on a common validation set, which, in this case, was defined as 20% of the training dataset. Then, the networks corresponding to the first 5 MAPE errors were selected, and the final forecasts were obtained by averaging the individual forecasts of these 5 models.

Results
In this section, the results of extensive simulations of the proposed model are presented. A set of scatterplots is shown in Figure 7a-f, representing the actual versus the predicted values mixed load values for 1, 2, 3, 6, and 24-h-ahead horizons, respectively, through the whole testing dataset. The diagonal line implies a complete match between real values and forecasts. The axes are presented in units of normalized AP.

Results
In this section, the results of extensive simulations of the proposed model are presented. A set of scatterplots is shown in Figure 7a-f, representing the actual versus the predicted values mixed load values for 1, 2, 3, 6, and 24-h-ahead horizons, respectively, through the whole testing dataset. The diagonal line implies a complete match between real values and forecasts. The axes are presented in units of normalized AP. Additional results are provided in Table 2, which contains information about the forecasting performance of the proposed method in comparison to the individual machine-learning methods comprising the model pool. In order to distinguish the results for different prediction time horizons, the table is divided into sections. The accuracy of model predictions is evaluated through the correlation coefficient (R 2 ), RMSE and MAE, considering them as representative and efficient criteria [106]. For comparative reasons, the table also contains the values of the indices for all submodels, as well as their percentage of ranking in the first place. This quantity, labeled as "Rank 1" in Table 2, denotes how many times each submodel scored the 1st rank among all submodels, i.e., achieved the lowest MAE. Additional results are provided in Table 2, which contains information about the forecasting performance of the proposed method in comparison to the individual machinelearning methods comprising the model pool. In order to distinguish the results for different prediction time horizons, the table is divided into sections. The accuracy of model predictions is evaluated through the correlation coefficient (R 2 ), RMSE and MAE, considering them as representative and efficient criteria [106]. For comparative reasons, the table also contains the values of the indices for all submodels, as well as their percentage of ranking in the first place. This quantity, labeled as "Rank 1" in Table 2, denotes how many times each submodel scored the 1st rank among all submodels, i.e., achieved the lowest MAE.
The aforementioned form of ranking of the submodels can be seen graphically in Figure 8. More specifically, each one of Figure 8a-f refers to 15 min, 1, 2, 3, 6, and 24-h prediction horizons, respectively. Each one of these subfigures contains 6 pie charts, denoting 1st to 6th rank for the models. To be more specific, each pie chart shows the percentages corresponding to how many times each submodel ranked in the respective place, according to its weighted MAE. For example, the 2nd pie of Figure 8a implies that for 15 min-ahead forecasting, the MLP submodel ranked in the 2nd place among all models with a percentage of 17%, the SR submodel with a percentage of 21%, etc. Finally, analytical graphs are provided for each prediction time horizon, with Figure 9a1-f1 to depict forecasts of 15 min, 1, 2, 3, 6, and 24 h-ahead, respectively, where a randomly chosen 12-h time window (from 09:00 to 21:00) of real AP values and the respective predictions are shown for an arbitrarily chosen day belonging to the testing subset (the same day and the same window is used for all horizons). These graphs are accompanied by Figure 9a2-f2, which indicates which submodel has the largest weight for every predicted data point using a bar plot. graphs are provided for each prediction time horizon, with Figure 9a1-f1 to depict forecasts of 15 min, 1, 2, 3, 6, and 24 h-ahead, respectively, where a randomly chosen 12-h time window (from 09:00 to 21:00) of real AP values and the respective predictions are shown for an arbitrarily chosen day belonging to the testing subset (the same day and the same window is used for all horizons). These graphs are accompanied by Figure 9a2-f2, which indicates which submodel has the largest weight for every predicted data point using a bar plot.    (e) (f)

Discussion
In the context of the case study, multiple experiments were conducted, and the results are explained and discussed here. At this point, we should point out that providing accurate predictions is indeed a challenging task due to both grid and data-related reasons. First, the system's expandability can be a limiting factor for the accuracy of future forecasts. At the same time, this is reinforced by inherent characteristics of the load time series, such as non-linearity and uncertainty. In the face of these challenges, the proposed method seems to be quite effective, providing reliable predictions. From Figure 7a-f, we can draw conclusions about the quality of predictions. When the prediction time horizon is too short (Figure 7a), the forecast error is distributed close to the diagonal line, which implies quite accurate predictions. While we are trying to increase the prediction horizon, the forecasts are getting less accurate (Figure 7b-f), as obviously, the pairs of real and predicted values are scattered further from the ideal line.

Discussion
In the context of the case study, multiple experiments were conducted, and the results are explained and discussed here. At this point, we should point out that providing accurate predictions is indeed a challenging task due to both grid and data-related reasons. First, the system's expandability can be a limiting factor for the accuracy of future forecasts. At the same time, this is reinforced by inherent characteristics of the load time series, such as non-linearity and uncertainty. In the face of these challenges, the proposed method seems to be quite effective, providing reliable predictions. From Figure 7a-f, we can draw conclusions about the quality of predictions. When the prediction time horizon is too short (Figure 7a), the forecast error is distributed close to the diagonal line, which implies quite accurate predictions. While we are trying to increase the prediction horizon, the forecasts are getting less accurate (Figure 7b-f), as obviously, the pairs of real and predicted values are scattered further from the ideal line.
Looking at Table 2, we observe that the proposed model outmatches all individual submodels, and the competitive MLP model ensemble in terms of MAE, and R 2 and RMSE. Moreover, this conclusion applies to all prediction time horizons. As the prediction horizon gets longer, the forecasting error increases, which is absolutely reasonable. The only exceptions are the R 2 and RMSE values obtained by the MLP model ensemble for 2 h prediction horizon, which slightly exceeds those of the proposed model. However, these differences cannot be considered significant as they are marginal, while on the other hand, the corresponding value of the MAE index clearly favors the proposed method. A result worth mentioning is the improvement of the multi-model performance over the current best sub-model that occurs in most cases while the horizon is getting longer. More specifically, the reduction of MAE that the proposed approach achieves over the best of the individual models ranges from 0.03411 to 0.3156. Such an improvement in performance could be partly explained by the occurrence of uncertainty in the load time series. As the prediction horizon is getting longer, the level of uncertainty is also increased, which is better addressed by the ensemble model than each individual submodel alone.
Regarding the efficiency of the individual models of the pool, the results of MAE, RMSE, and R 2 show that there is not just one model to prevail over the others in all cases. For the shorter prediction horizons and, more specifically, up to 3 h, LR and SR appear to achieve marginally smaller forecasting errors than their non-linear counterparts. Although the non-linearities are an intrinsic characteristic of mixed load [107], this behavior becomes more apparent as the prediction horizon is getting longer. As a result, models which are based on LR are able to provide robust results for very short-term forecasts. On the other hand, one major advantage of neural networks is their capability of modelling non-linear systems. An important observation is that neural networks appear to perform better for longer prediction horizons, and this can be attributed to the fact that, as the prediction horizon is getting longer, the non-linear properties of the load are becoming more dominant. Therefore, when predictions for longer horizons are required, MLP neural networks take the lead. However, the same does not apply to RBF networks. As stated above, in order for RBF networks to perform well, dense and suitable data are required. Consequently, their performance is reduced for 24-h prediction horizons, where the input information is poorer due to the resampling process. Although the remaining models of the pool, SVR and RF networks, present a moderate predictive capability, they contribute positively to the overall performance of the proposed model. This conclusion confirms our claim of the need to use multiple models in order to enhance the reliability of load predictions.
Looking at the results of actual and predicted values in Figure 9 we confirm that, as the time horizon increases, accurate load forecasting becomes more and more difficult. Continuing with the subfigures of Figure 9 that show the alternation between submodels in order to maintain the accuracy of predictions, we conclude that the weighting mechanism of the proposed model seems to perform adequately regardless of the time horizon. It can easily be seen that quite reliable forecasts are obtained during the steady rise or fall of the actual AP values. On the contrary, predictions become less accurate when the AP presents great fluctuation.
Several quite interesting conclusions can also be drawn from the pie charts in Figure 8. Each percentage in the pies represents the degree to which the respective model yielded the highest weight or equivalently the lowest MAE. The highest percentages of the first rank (above 18%) belong to MLP, RBF, and RF, and this applies for all horizons except that of 24 h, where SVR takes the place of RF. RF, in particular, scores lower MAE most of the time when the prediction horizon does not exceed 3 h. Beyond that point, RBF neural networks outperform the rest of the submodels. An interesting observation is that the aforementioned models have equally high percentages in the sixth rank. Thus, these methods either achieve very good or poor performance. This observation is quite significant and strongly enhances the usefulness and effectiveness of our proposed method. The percentages of the rest of the pool models are, in most cases, divided into the intermediate rankings, with the exception of the high percentage of SVR in the sixth rank for the 6-h horizon.

Conclusions and Prospects
Achieving reliable electric load forecasts is of paramount importance for the smooth operation of electric power grids. However, the intrinsic volatility of the electric load makes its prediction particularly hard. Therefore, we assume that the load behavior is influenced by multiple input variables, which differ depending on the data to be predicted. The mixed load forecasting task has been addressed by a variety of machine-learning methods, and it has been observed that none is able to provide equally accurate results for any testing dataset. In the present study, a mixed power-load forecasting model is introduced, which employs the predictions coming from several individual models, namely MLP and RBF neural networks, LR, SVR, RF, and SR. These forecasts are weighted based on how accurate they have been and then added to calculate the final forecast value. The proposed model provides predictions for different time horizons, spanning from 15 min to 24 h. The extended results presented using real data sensed from a high voltage/medium voltage substation show the superiority of this novel approach compared to all the individual models as well as an MLP model ensemble for every prediction horizon tested. Thus, the proposed multi-model forecasting scheme constitutes a powerful method capable of greatly enhancing the operation of the modern electricity grid, with potential practical applications in network planning, operation, and management.
It should be noted here that a limitation of the present study is that it did not involve predictions for long-term horizons. Although investigating longer prediction horizons is outside the scope of this work, we believe the proposed model ensemble could serve as the basis for designing such a tool. On the other hand, it is quite probable that a different set of input variables, presenting higher correlation with the long-term evolution of the mixed load would be needed in this case.
Driven by the remarkable performance of the proposed methodology in mixed load forecasting, its application could be extended to other critical sectors of the smart grid, such as forecasting the electricity price and the production from RES, in order to more efficiently schedule conventional sources. A fruitful application would also be to forecast the residential demand or the aggregated load corresponding to several substations. Another promising direction for future research towards this direction includes the integration of Graph Neural Networks, which have been proved to be a promising candidate due to their ability to successfully interpret spatiotemporal features of the input data.  Data Availability Statement: The data are not publicly available due to confidentiality and privacy reasons.